Overview/Motivation

The purpose of this project is to examine the Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) and related healthcare quality and patient safety metrics that capture patient’s overall experience at a hospital.

As part of the Affordable Care Act, the government created a national, standardized, publicly reported survey of the patient’s experience by hospital (i.e. the HCAHPS). The HCAHPS contains qualitative measures of a patient’s experience including cleanliness of the hospital, pain management, effective communication by doctors and nurses, etc. HCAHPS Scores are based on surveys sent to patients by the Centers for Medicaid and Medicare (CMS). A random sampling of patients receive surveys from CMS. The survey responses are controlled for a number of factors including demographics and hospital location. CMS publishes a star rating for each of the survey questions and for an overall score, the highest rating a hospital can get is 5 stars and the lowest is 1 star.The survey questions are primarily centered around patient satisfaction and quality of service in the hospital, such as: how well did your doctor communicate with you? Did the staff explain the medications you were given? Was your pain well managed? Was it quiet?

The survey was created not only to help gather data on patient’s experiences and help hospitals improve their practices, but will also be used in Medicare reimbursement calculations. As one can imagine, this has led to some engaged discussions on the value of factoring in patient’s perception of care vs. more quantitative measures of care (e.g. survival rates, infection rates, etc).

This project was inspired by this question of the relationship between the HCAHPS’s subjective ratings and other quantitative measures of patient experience. Our approach to the answer is to create a predictive model of star ratings for NY hospitals. First, we will look at other metrics within HCAHPS that can be leveraged within our model. Then, we will explore other hospital/Medicare databases and see whether metrics such as cost, mortality rates, infection rates, readmission rates, and waiting times are also predictive of star ratings. Along the way, we hope to answer some of the following questions:

The sources of our data can be found below:

HCAHPS: https://data.medicare.gov/Hospital-Compare/Patient-survey-HCAHPS-Hospital/dgck-syfz

Cost-related: https://data.medicare.gov/Hospital-Compare/Payment-and-value-of-care-Hospital/c7us-v4mf https://health.data.ny.gov/Health/Hospital-Inpatient-Cost-Transparency-Beginning-200/7dtz-qxmr

Complications/Infections: https://health.data.ny.gov/Health/Cardiac-Surgery-and-Percutaneous-Coronary-Interven/jtip-2ccj https://health.data.ny.gov/Health/Hospital-Acquired-Infections-Beginning-2008/utrt-zdsi https://data.medicare.gov/Hospital-Compare/Complications-Hospital/632h-zaca https://data.medicare.gov/Hospital-Compare/Hospital-Readmissions-Reduction-Program/9n3s-kdb3 https://data.medicare.gov/Hospital-Compare/Readmissions-Complications-and-Deaths-Hospital/7xux-kdpw

Service: https://data.medicare.gov/Hospital-Compare/Timely-and-Effective-Care-Hospital/yv7e-xc69

Data Wrangling

Load files from github repo:

library(broom)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4
library(knitr)
library(tidyr)
library(gplots)
## Warning: package 'gplots' was built under R version 3.2.4
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#cost sets
paymentandvalue <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Payment_and_value_of_care_-_Hospital.csv") #national
inpatientcost <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital_Inpatient_Cost_Transparency__Beginning_2009.csv") #NYS

#risk of badness sets
PCIbyhospital <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Cardiac_Surgery_and_Percutaneous_Coronary_Interventions_by_Hospital___Beginning_2008.csv") #NYS

complicationshospital <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Readmissions_Complications_and_Deaths_-_Hospital.csv") #national

#service
hcahps <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/hcahps.csv") #national
timely <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Timely_and_Effective_Care_-_Hospital.csv") #national

#readmisisons
readmission <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital_Readmissions_Reduction_Program.csv") #national 

#infections
hospitalinfection <- read.csv("https://raw.githubusercontent.com/anniey/Hospital-Ranking-CS107/master/Raw%20Data/Hospital-Acquired_Infections__Beginning_2008.csv") #ny

Our datasets are publicly available from either New York State or the federal government. New York State and the federal government have different naming conventions and don’t share a common identifier for hospitals. In order to get comparable data from the two sources we need to wrangle the data.

Firstly, let’s limit national data sets to New York State only:

paymentandvalue <- paymentandvalue %>% filter(State=="NY")
complicationshospital <- complicationshospital %>% filter(State=="NY")
hcahps <- hcahps %>% filter(State=="NY")
timely <- timely %>% filter(State=="NY")
readmission <- readmission %>% filter(State=="NY")

Additional wrangling of factor to numeric

#change to numeric
complicationshospital$Denominator <- as.numeric(as.character(complicationshospital$Denominator))
complicationshospital$Score <- as.numeric(as.character(complicationshospital$Score))

#strip out the special characters and convert to numeric
inpatientcost$Mean.Charge <- sub('\\$','', as.character(levels(inpatientcost$Mean.Charge))[inpatientcost$Mean.Charge]) %>%
  as.numeric()

inpatientcost$Median.Charge <- sub('\\$','', as.character(levels(inpatientcost$Median.Charge))[inpatientcost$Median.Charge]) %>%
  as.numeric()

paymentandvalue$Payment <- sub('\\$','', as.character(levels(paymentandvalue$Payment))[paymentandvalue$Payment]) %>%
  as.numeric()

paymentandvalue$Denominator <- sub('\\,','', as.character(levels(paymentandvalue$Denominator))[paymentandvalue$Denominator]) %>%
  as.numeric()

timely$Score <- as.character(levels(timely$Score))[timely$Score] %>%
  as.numeric()

#change provider ID to numeric
timely$Provider.ID <- as.character(levels(timely$Provider.ID))[timely$Provider.ID] %>%
  as.numeric()

complicationshospital$Provider.ID <- as.character(levels(complicationshospital$Provider.ID))[complicationshospital$Provider.ID] %>%
  as.numeric()

paymentandvalue$Provider.ID <- as.character(levels(paymentandvalue$Provider.ID))[paymentandvalue$Provider.ID] %>%
  as.numeric()

Because there are no common identifiers between national and state data (national uses Provider ID and NY state uses Facility ID), we will have to perform our join using hospital names.

#use hospital infection naming convention as that is the closest to the provider naming convention
facilityID_data <- hospitalinfection %>%
  filter(Facility.Id > 0, Year == 2014 ) %>%
  select(Facility.Id, Hospital.Name, Infections.observed) %>%
  group_by(Facility.Id, Hospital.Name) %>%
  summarise(Infections = mean(Infections.observed))

#adjust col names of PCI
colnames(PCIbyhospital)[1] <- "Facility.Id"
colnames(PCIbyhospital)[6] <- "Year"

#only take risk adjusted mortality rate and the latest year
facilityID_data <- PCIbyhospital %>%
  filter(Year == 2011, Procedure == "All PCI") %>%
  select(c(Facility.Id, Number.of.Cases, Risk.Adjusted.Mortality.Rate)) %>%
  right_join(facilityID_data, by = "Facility.Id")


#clean case convention and column name
facilityID_data$Hospital.Name <- toupper(facilityID_data$Hospital.Name)

#use complicationshospital as the conventional naming
providerID_data <- complicationshospital %>%
  select(c(Provider.ID, Hospital.Name))

Now let’s try to join the ProviderID dataset with the FacilityID dataset using the Hospital/Facility Name. First, we need to clean up some of the varying naming conventions between the two datasets. We will use ProviderID as the convention given that is the one used by HCAHPS

#find the hospital names that are slightly different between the datasets

facilityID_data$Hospital.Name <- as.character(facilityID_data$Hospital.Name)
providerID_data$Hospital.Name <- as.character(providerID_data$Hospital.Name)

FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

HospitalNames <- providerID_data %>%
  select(Hospital.Name, Provider.ID) %>%
  as.data.frame() %>%
  unique()

#merge the tables and find where there is unmerged rows
ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

Quickly scanning the list of merged names, we notice that a common difference in convention between the datasets is the use of “INC” vs “, INC”. Let’s update FacilityID from “INC” to “, INC”.

#Update FacilityID's uses "INC" to match ProviderID's ", INC"
facilityID_data$Hospital.Name <- gsub(" INC", ", INC", facilityID_data$Hospital.Name)

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

It also looks like the facility database has separate IDs for different divisions of a hospital provider. Let’s remap all these divisions to the parent provider

#rename Catskill to remove from further cleaning as we don't want to remove the - in this case
facilityID_data$Hospital.Name <- gsub("CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE", "CATSKILL REGIONAL MEDICAL CENTER G HERMANN SITE", facilityID_data$Hospital.Name)

#remove division name following -
facilityID_data$Hospital.Name <- gsub(" - [A-z ]*", "", facilityID_data$Hospital.Name)

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE)

We still have a couple more manual changes to make

#create table remapping facility name to hospital name
facility_orig = c("ADIRONDACK MEDICAL CENTER-SARANAC LAKE SITE", "BROOKHAVEN MEMORIAL HOSPITAL MEDICAL CENTER INC", "BROOKLYN HOSPITAL CENTER AT DOWNTOWN CAMPUS", "CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE", "CHAMPLAIN VALLEY PHYSICIANS HOSPITAL MEDICAL CENTER", "CLIFTON-FINE HOSPITAL", "FAXTON-ST LUKES HEALTHCARE ST LUKES DIVISION", "FAXTON-ST LUKES HEALTHCARE FAXTON DIVISION", "JOHN T MATHER MEMORIAL HOSPITAL OF PORT JEFFERSON NEW YORK, INC", "MARGARETVILLE HOSPITAL", "MONTEFIORE", "MONTEFIORE MED CENTER", "MONTEFIORE MEDICAL CENTER& LUCY MOSES DIV", "MOUNT VERNON HOSPITAL", "MOUNT ST MARYS HOSPITAL AND HEALTH CENTER", "NEW YORK COMMUNITY HOSPITAL OF BROOKLYN,, INC", "NEW YORK PRESBYTERIAN HOSPITAL", "RIVER HOSPITAL,, INC.", "ROME MEMORIAL HOSPITAL,, INC", "SCHUYLER HOSPITAL", "SOLDIERS AND SAILORS MEMORIAL HOSPITAL OF YATES, INC", "ST JOHNS EPISCOPAL HOSPITAL SO SHORE", "ST JOSEPHS HOSPITAL", "ST JOSEPHS HOSPITAL HEALTH CENTER", "ST JOSEPH'S MC-ST VINCENTS WESTCHESTER DIVISION", "ST LUKE'S CORNWALL HOSPITAL/NEWBURGH", "ST LUKES ROOSEVELT HOSPITAL", "ST LUKES ROOSEVELT HOSPITAL CENTER", "ST PETERS HOSPITAL", "ST. JOSEPH HOSPITAL", "STATEN ISLAND UNIVERSITY HOSP-NORTH", "STATEN ISLAND UNIVERSITY HOSP-SOUTH", "UNITED HEALTH SERVICES HOSPITALS, INC.", "UNITED MEMORIAL MEDICAL CENTER BANK STREET CAMPUS", "UNITED MEMORIAL MEDICAL CENTER NORTH STREET CAMPUS", "UNIVERSITY HOSPITAL", "UNIVERSITY HOSPITAL OF BROOKLYN", "UNIVERSITY HOSPITAL SUNY HEALTH SCIENCE CENTER", "WOMANS CHRISTIAN ASSOC HOSPITAL", "WOODHULL MEDICAL & MENTAL HEALTH CENTER")

facility_new = c("ADIRONDACK MEDICAL CENTER", "BROOKHAVEN MEMORIAL HOSPITAL MEDICAL CENTER", "BROOKLYN HOSPITAL CENTER", "CATSKILL REGIONAL MEDICAL CENTER - G HERMANN SITE", "CHAMPLAIN VALLEY PHYSICIANS HOSPITAL MEDICAL CTR", "CLIFTON FINE HOSPITAL", "FAXTON-ST LUKE'S HEALTHCARE", "FAXTON-ST LUKE'S HEALTHCARE", "JOHN T MATHER MEMORIAL HOSPITAL OF PORT JEFFERSON", "MARGARETVILLE MEMORIAL HOSPITAL", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MEDICAL CENTER", "MONTEFIORE MOUNT VERNON HOSPITAL", "MOUNT ST MARY'S HOSPITAL AND HEALTH CENTER", "NEW YORK COMMUNITY HOSPITAL OF BROOKLYN, INC.", "NEW YORK-PRESBYTERIAN HOSPITAL", "RIVER HOSPITAL, INC", "ROME MEMORIAL HOSPITAL, INC", "SCHUYLER HOSPITAL, INC", "SOLDIERS AND SAILORS MEMORIAL HOSPITAL OF YATES", "ST JOHN'S EPISCOPAL HOSPITAL AT SOUTH SHORE", "ST JOSEPH'S HOSPITAL, INC", "ST JOSEPH'S HOSPITAL HEALTH CENTER", "ST JOSEPH'S MEDICAL CENTER", "ST LUKE'S CORNWALL HOSPITAL", "ST LUKE'S ROOSEVELT HOSPITAL","ST LUKE'S ROOSEVELT HOSPITAL CENTER", "ST PETER'S HOSPITAL", "ST JOSEPH HOSPITAL", "STATEN ISLAND UNIVERSITY HOSPITAL", "STATEN ISLAND UNIVERSITY HOSPITAL", "UNITED HEALTH SERVICES HOSPITALS, INC", "UNITED MEMORIAL MEDICAL CENTER", "UNITED MEMORIAL MEDICAL CENTER", "UNIVERSITY HOSPITAL (STONY BROOK)", "UNIVERSITY HOSPITAL OF BROOKLYN (DOWNSTATE)", "UNIVERSITY HOSPITAL S U N Y HEALTH SCIENCE CENTER", "WOMAN'S CHRISTIAN ASSOCIATION HOSPITAL", "WOODHULL MEDICAL AND MENTAL HEALTH CENTER")

#renaming table
remapped_names <- data.frame("Facility Name" = facility_orig, "Provider Name" = facility_new, stringsAsFactors = FALSE)

#rename the hospitals in the facilityID_data
facilityID_data$Hospital.Name <- sapply(facilityID_data$Hospital.Name, function(x){
  ifelse(x %in% remapped_names$Facility.Name, remapped_names[facility_orig == x, 2], x)
})

#rerun the merged datasets
FacilityNames <- facilityID_data %>%
  select(Hospital.Name, Facility.Id) %>%
  as.data.frame() %>%
  unique()

ID_key <- merge(FacilityNames, HospitalNames, by = "Hospital.Name", all = TRUE) %>%
  na.omit() %>%
  select(-c(Hospital.Name))

We now have a key matching facility ID to provider ID for around 130 hospitals. We can use this later when we are looking to investigate the relationship between metrics from the national datasets (eg HCAHPS, readmission) and the New York datasets (eg infection rates, payments)

Part 1 of Data Exploration: Examining HCAHPS

For HCAHPS, we are only really concerned with the linear score and the star rating. Let’s keep only those metrics and spread the measures so that each measure ID is its own column.

#spread the ratings and the other scores
HCAHPS <- hcahps %>%
  select(Provider.ID, HCAHPS.Measure.ID, Patient.Survey.Star.Rating) %>%
  filter(HCAHPS.Measure.ID == "H_STAR_RATING") %>%
  spread(HCAHPS.Measure.ID, Patient.Survey.Star.Rating)

#only keep the aggregate metrics
HCAHPS <- hcahps %>%
  select(Provider.ID, HCAHPS.Measure.ID, HCAHPS.Linear.Mean.Value) %>%
  filter(HCAHPS.Measure.ID %in% c("H_CLEAN_LINEAR_SCORE", "H_COMP_1_LINEAR_SCORE", "H_COMP_2_LINEAR_SCORE","H_COMP_3_LINEAR_SCORE", "H_COMP_4_LINEAR_SCORE","H_COMP_5_LINEAR_SCORE", "H_COMP_6_LINEAR_SCORE","H_COMP_7_LINEAR_SCORE","H_HSP_RATING_LINEAR_SCORE", "H_QUIET_LINEAR_SCORE")) %>%
  right_join(HCAHPS, by = "Provider.ID")

#clean NAs
HCAHPS[HCAHPS == "Not Available"] = NA
HCAHPS <- na.omit(HCAHPS)

#transform classes
HCAHPS$HCAHPS.Linear.Mean.Value <- as.numeric(as.character(HCAHPS$HCAHPS.Linear.Mean.Value))

Let’s plot these scores and see if there is any relationship to star rating.

HCAHPS %>%
  ggplot(aes(H_STAR_RATING, HCAHPS.Linear.Mean.Value)) +
  geom_boxplot() +
  ylab("Linear Score") +
  xlab("Star Rating") +
  ggtitle("Star Rating vs. Linear Score of Other HCAHPS Metrics")

A quick look at the aggregated scores against star ratings boxplot shows a strong positive relationship between the scores and the star rating. Now, let’s see which of the scores are most predictive of a hospitals star rating.

#find the distance between the scores and star rating
HCAHPS <- HCAHPS %>%
  spread(HCAHPS.Measure.ID, HCAHPS.Linear.Mean.Value) 

HCAHPS_sum <- HCAHPS %>%
  select(-c(Provider.ID))

#let's make the column names friendly
colnames(HCAHPS_sum) <- c("Star_Rating", "Cleanliness", "NurseComm", "DoctorComm", "StaffResponsiveness", "PainMmgt.", "MedicationComm.", "Discharge_Info", "Care_Transition", "Overall_Rating", "Quietness")

#turn the star rating into a number
HCAHPS_sum$Star_Rating <- as.numeric(as.character(HCAHPS_sum$Star_Rating))

#we have to standardize bc star rating is in a different unit
HCAHPS_dist <- as.matrix(HCAHPS_sum) %>%
  scale() %>%
  t() %>%
  dist()

heatmap.2(as.matrix(HCAHPS_dist), 
          margins = c(10,10),
          cellnote= round(as.matrix(HCAHPS_dist),4), 
          notecex=1.0,
          notecol="cyan",
          dendrogram = 'none',
          trace='none',
          key = FALSE)

It looks like nurse communication is the closest to star rating, followed by Medication Communication.

Now, let’s see if we can build a knn model from these additional HCAHPS metrics. First, we will need to determine the best K to use.

set.seed(1)
control <- trainControl(method='cv', number=10, p=.7)

#we will need to center and scale since we are using different units
res_knn <- train(Star_Rating ~ .,
             data = HCAHPS_sum,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,15,1)))

#plot the results
res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over K")

res_knn
## k-Nearest Neighbors 
## 
## 154 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 139, 139, 138, 138, 139, 139, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   RMSE SD     Rsquared SD
##    1  0.3550570  0.8146416  0.16589106  0.11909805 
##    2  0.3249938  0.8633711  0.07835016  0.04686860 
##    3  0.3357165  0.8480199  0.07676584  0.05333952 
##    4  0.3309172  0.8510118  0.08068019  0.05460303 
##    5  0.3211934  0.8605531  0.06832507  0.05108204 
##    6  0.3123021  0.8703599  0.07591906  0.04591637 
##    7  0.3111902  0.8700788  0.07583596  0.05406197 
##    8  0.3055101  0.8743962  0.07269430  0.05395120 
##    9  0.3046925  0.8748050  0.07032935  0.05860115 
##   10  0.3107535  0.8724046  0.06738323  0.05499346 
##   11  0.3112200  0.8742018  0.06260791  0.04959820 
##   12  0.3176261  0.8710675  0.05717181  0.04746055 
##   13  0.3157013  0.8728423  0.05785688  0.04729927 
##   14  0.3219121  0.8690111  0.06167045  0.05196969 
##   15  0.3247508  0.8681348  0.05592365  0.04542577 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 9.

From the above RMSE graph, it looks like a k of 9 will give us the best ratings.

set.seed(1)
#let's create a model with k = 9
inTrain <- createDataPartition(y = HCAHPS_sum$Star_Rating , p = 0.8)
train_set <- slice(HCAHPS_sum, inTrain$Resample1)
test_set <- slice(HCAHPS_sum, -inTrain$Resample1)

#reset factors for test set
test_set$Star_Rating  <- factor(test_set$Star_Rating )

#fit the data
pred <- predict(res_knn, newdata = test_set)
tab <- table(pred = round(pred), truth = test_set$Star_Rating)

confusionMatrix(tab)$overall["Accuracy"]
##  Accuracy 
## 0.8965517

From the above, we see that the remaining metrics in HCAHPS has about a 90% accuracy rating in predicting star ratings.

What happens when we try to replicate this 10 times?

knn_10 <- sapply(1:10, function(i){
  inTrain <- createDataPartition(y = HCAHPS_sum$Star_Rating , p = 0.8)
  train_set <- slice(HCAHPS_sum, inTrain$Resample1)
  test_set <- slice(HCAHPS_sum, -inTrain$Resample1)
  
  test_set$Star_Rating  <- factor(test_set$Star_Rating )
  
  pred <- predict(res_knn, newdata = test_set)
  
  #special treatment to catch corner cases where the number of stars in the prediction don't match the number of stars in the truth (i.e. there is a 4 star in the truth but not in the prediction)
  
  pred = round(pred)
  truth = test_set$Star_Rating
  u = union(pred, truth)
  tab <- table(factor(pred, u), factor(truth, u))
  
  confusionMatrix(tab)$overall["Accuracy"] 
})

mean(knn_10)
## [1] 0.9103448
sd(knn_10)
## [1] 0.04654818

We still have an average accuracy of around 90%, with a standard deviation of around 4%.

Now, we will expand our search outside of the HCAHPS and see if metrics from other databases can help us improve our predictions. Before we link these new metrics to HCAHPS, we will first explore each dataset and examine some additional hypotheses that would impact patient experience.

Part Two of Data Exploration: Other potential predictors initial data analysis

As we have several other potential predictors in our other datasets, we will limit this preliminary data exploration to just those that have cost and mortality metrics, as we assume that these two are most likely to impact patient experience. Additionally, we will consider the hospital case load. Our hypothesis is that hospitals with higher case loads have more experience treating a given disease which may translate to better and more efficient treatment that will lower costs and increase survival.

Cost

We will start our cost examination with the inpatientcost dataset. This dataset allows us to look at cost, controlling for the severity of the injury/procedure. First, let us sanity check this dataset and verify that cost is higher for more severe treatments.

Severity<- inpatientcost %>% 
  filter(APR.DRG.Description=="Heart Failure")

#reorder the factors from minor to extreme
Severity$APR.Severity.of.Illness.Description <- factor(Severity$APR.Severity.of.Illness.Description, levels = c("Minor", "Moderate", "Major", "Extreme"), ordered = TRUE)
  

#boxplot of charge by severity
Severity %>%
  ggplot(aes(APR.Severity.of.Illness.Description, Mean.Charge)) +
  geom_boxplot() +
  ylab("Mean Charge") +
  xlab("Serverity of Illness")

#we will also check the median to avoid issues with skews
Severity %>%
  ggplot(aes(APR.Severity.of.Illness.Description, Median.Charge)) +
  geom_boxplot() +
  ylab("Median Charge") +
  xlab("Serverity of Illness")

As expected, it appears extreme cases have higher charges than the other cases. That being said, interestingly enough, there is not much difference in cost between major and minor treatments.

Let’s see if there is a relationship between discharge and mean cost in this dataset. The idea behind this hypothesis is that hospitals that treat more patients for a certain procedure or disorder may have improved efficiency that allows them to lower the treatment charge.

#plot of mean charge vs number of cases
Severity %>%
  ggplot(aes(Discharges, Mean.Charge)) +
  geom_point() +
  facet_wrap(~APR.Severity.of.Illness.Description, scales = "free_x")

#median
Severity %>%
  ggplot(aes(Discharges, Median.Charge)) +
  geom_point() +
  facet_wrap(~APR.Severity.of.Illness.Description, scales = "free_x")

Unfortunately, it looks like there is no relationship between number of patients seen and cost of treatment. This suggests that, contrary to expectation, hospitals that see more cases of a certain disorder/disease may not actually have an advantage in scalability to reduce costs.

Mortality

We now move on to looking at the relationship between mortality and number of cases.

First, we will re-examine our earlier test of the relationship between cost and number of patients seen.

#clean the data
PV <- paymentandvalue %>%
  select(Provider.ID, Denominator, Payment, Payment.measure.ID, Value.of.care.category) %>%
  na.omit() %>%
  filter(Denominator != "Not Available")

#plot cost and number of cases a hospital sees
ggplot(PV, aes(Denominator, Payment)) + 
  geom_point() +
  facet_wrap(~Payment.measure.ID, scales = "free_x") +
  xlab("Case count")

Again, we see that there is not really a change in cost for a treatment with increased number of cases seen. For pneumonia, we do see some normalization of cost to a little under 15000 once the case count gets beyond 500.

Let’s now see whether value of care, as defined by mortality and payment intersection, is impacted by the number of cases seen.

#update the order
PV$Value.of.care.category <- factor(PV$Value.of.care.category, levels = c("Worse mortality and lower payment", "Worse mortality and average payment", "Worse mortality and higher payment", "Average mortality and lower payment", "Average mortality and average payment", "Average mortality and higher payment", "Better mortality and lower payment", "Better mortality and average payment", "Better mortality and higher payment"), ordered = TRUE)

#let's now check number of cases and value of care
PV %>%
  ggplot(aes(Value.of.care.category, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Value of Care") +
  ggtitle("Value of Care vs. Number of Cases")

#let's check heart disease in particular
PV %>% filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(Value.of.care.category, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Value of Care") +
  ggtitle("Value of Care vs. Number of Cases for Heart Failure")

From the boxplots, we can see that number of cases seen does not appear to vary much between average mortality and worse mortality outcomes. However, it does appear that hospitals with better mortality tend to have seen more cases.

Let’s focus on just mortality, ignoring the payment dimension, and observe the impact of cases there.

#bucket the mortality rates, ignoring payment
PV <- PV %>%
  mutate(Care.Cat = ifelse(grepl("Better mortality", Value.of.care.category), "Better Mortality", 
                           ifelse(grepl("Average mortality", Value.of.care.category), "Average Mortality", "Worse Mortality")))

#reorder
PV$Care.Cat <- factor(PV$Care.Cat, levels = c("Worse Mortality", "Average Mortality", "Better Mortality"), ordered = TRUE)

#let's plot a boxplot of the number of cases by group
PV %>%
  ggplot(aes(Care.Cat, Denominator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylab("Number of Cases") +
  xlab("Mortaity Rate") +
  facet_wrap(~Payment.measure.ID, scales = "free_x")

In ignoring the payment categories and focusing on just mortality, we can see that mortality does seem to be better at hospitals that observe more cases for all three types of illnesses. Let us see if there is enough of a relationship for us to predict mortality rates using number of cases seen by a hospital. We will focus on heart failure.

set.seed(1)

#turn character into numeric
PV_HF <- mutate(PV, Care.Cat = ifelse(grepl("Better Mortality", Care.Cat), 3, ifelse(grepl("Average Mortality", Care.Cat), 2, 1))) %>%
  filter(Payment.measure.ID == "PAYM_30_HF")


#create train and test sets
inTrain <- createDataPartition(y = PV_HF$Care.Cat, p = 0.8)
train_set <- slice(PV_HF, inTrain$Resample1)
test_set <- slice(PV_HF, -inTrain$Resample1)

#fit the data
fit <- glm(Care.Cat ~ Denominator, data=train_set)
pred <- predict(fit, newdata = test_set, type = "response")

#get the confusion matrix
pred = round(pred)
truth = test_set$Care.Cat
u = union(pred, truth)

tab <- table(factor(pred, u), factor(truth, u))

confusionMatrix(tab)$overall["Accuracy"]
##  Accuracy 
## 0.9333333

It looks like just using the number of cases is actually fairly predictive of the mortality rates for heart failure at hospitals. Let’s see if we can replicate this with a more common disorder like pneumonia:

set.seed(1)

#turn character into numeric
PV_PN <- mutate(PV, Care.Cat = ifelse(grepl("Better Mortality", Care.Cat), 3, ifelse(grepl("Average Mortality", Care.Cat), 2, 1))) %>%
  filter(Payment.measure.ID == "PAYM_30_PN")


#create train and test sets
inTrain <- createDataPartition(y = PV_PN$Care.Cat, p = 0.8)
train_set <- slice(PV_PN, inTrain$Resample1)
test_set <- slice(PV_PN, -inTrain$Resample1)

#fit the data
fit <- glm(Care.Cat ~ Denominator, data=train_set)
pred <- predict(fit, newdata = test_set, type = "response")

pred = round(pred)
truth = test_set$Care.Cat
u = union(pred, truth)

tab <- table(factor(pred, u), factor(truth, u))

confusionMatrix(tab)$overall["Accuracy"]
##  Accuracy 
## 0.9032258

Though slightly less than heart failure, pneumonia is still seeing an effect of number of cases on mortality rates. The accuracy for heart failure may be slightly higher owing to the fact that it is more complex and requires more training and experience than treating pneumonia, which is fairly common place.

Let’s see what happens when we leverage another database that has actual mortality rates. We will focus on heart failure for now.

CH <- complicationshospital %>%
  select(Measure.ID, Denominator, Score)

#let's look at the relationship for HF only
CH %>%
  filter(Measure.ID == "MORT_30_HF") %>%
  ggplot(aes(Denominator, Score)) +
  geom_point() +
  geom_smooth() +
  ylab("Heart Failure Mortality Rate") +
  xlab("Number of Cases")

It looks like there is a slight increase in mortality rate when the number of cases increase, but then it drops off as the number of cases continue to grow. Let’s now explore additional measures such as mortality rate for AMI and Strokes.

#let's add on the relationship for AMI and STK
CH %>%
  filter(Measure.ID %in% c("MORT_30_HF", "MORT_30_AMI", "MORT_30_STK")) %>%
  ggplot(aes(Denominator, Score, colour = Measure.ID)) +
  geom_point() +
  geom_smooth()  +
  ylab("Mortality Rate") +
  xlab("Number of Cases")

The drop in mortality rate is even more pronounced for AMI.

Let’s see what happens when we try to fit a regression to number of cases and mortality rate of AMI and HF.

#filter and clean NA
CH_HF <- CH %>%
  filter(Measure.ID == "MORT_30_HF") %>%
  na.omit()
  
CH_AMI <- CH %>%
  filter(Measure.ID == "MORT_30_AMI") %>%
  na.omit()

#apply regression
res_HF <- CH %>%
  filter(Measure.ID %in% c("MORT_30_HF")) %>%
  lm(Score ~ Denominator, data = .) %>%
  tidy(conf.int = TRUE)

res_AMI <- CH %>%
  filter(Measure.ID %in% c("MORT_30_AMI")) %>%
  lm(Score ~ Denominator, data = .) %>%
  tidy(conf.int = TRUE)

kable(res_HF)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 12.0604626 0.1759339 68.551118 0.000000 11.7130270 12.4078982
Denominator -0.0014524 0.0003698 -3.927509 0.000127 -0.0021827 -0.0007221
kable(res_AMI)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 15.2156647 0.1626288 93.56068 0e+00 14.894139 15.5371906
Denominator -0.0031339 0.0006045 -5.18433 7e-07 -0.004329 -0.0019388

From the above, we can see that with each additional case of HF, a hospital’s rate of mortality decreases between 0.1% to 0.2%. For AMI, the rate of decrease is even higher at 0.1% to 0.4%.

Another database that has mortality rating is PCIbyhospital. Let’s see if we can find the same relationship there.

#removing observed mortality rate of 0
PCI_mort <- PCIbyhospital %>%
  filter(Procedure == "All PCI", Hospital.Name != "NYS - All Hospitals", Observed.Mortality.Rate >0) %>%
  select(Number.of.Cases, Observed.Mortality.Rate, Risk.Adjusted.Mortality.Rate)

#plot Risk Adjusted rates
PCI_mort %>%
  ggplot(aes(Number.of.Cases, Risk.Adjusted.Mortality.Rate)) +
  geom_point() +
  geom_smooth(span = .5) +
  xlab("Risk Adjusted Mortality Rate") +
  ylab("Number of PCI Cases")

#plot observed rate
PCI_mort %>%
  ggplot(aes(Number.of.Cases, Observed.Mortality.Rate)) +
  geom_point() +
  geom_smooth() +
  geom_smooth(span = .5) +
  xlab("Observed Mortality Rate") +
  ylab("Number of PCI Cases")

We see a fairly sharp drop in mortality rate that plateaus once the number of cases surpasses 500. This appears to apply to both the observed and risk adjusted mortality rate. Let’s now fit a regression to number of cases against the two rates.

#apply a regression
res_Obs <- PCI_mort %>%
  lm(Risk.Adjusted.Mortality.Rate ~ Number.of.Cases, data = .) %>%
  tidy(conf.int = TRUE)

res_Adj <- PCI_mort %>%
  lm(Observed.Mortality.Rate ~ Number.of.Cases, data = .) %>%
  tidy(conf.int = TRUE)

kable(res_Obs)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.1268476 0.0511208 22.042825 0.0000000 1.025939 1.2277568
Number.of.Cases -0.0001226 0.0000428 -2.864481 0.0047006 -0.000207 -0.0000381
kable(res_Adj)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.4405224 0.0839115 17.167165 0.00e+00 1.2748866 1.6061581
Number.of.Cases -0.0003081 0.0000702 -4.386258 2.01e-05 -0.0004467 -0.0001694

We see about a similar relationship but at a decreased magnitude of 0.01%-0.04% decrease in mortality rate for each additional case seen.

Part 3 of Data Exploration: Final Model for HCAHPS Star Rating

Now that we have explored the other datasets, we will start examining the relationship between these new metrics and HCAHPS star rating to see if we can improve our model.

Cost and Mortality

First, we will examine cost. Cost is notoriously nebulous and difficult to understand in healthcare. It is nearly impossible to get a quote on the cost of a procedure before going to the hospital. Often patients never see the bill after going to the hospital. So we wondered what effect, if any, cost would have on patient satisfaction. Are patients less likely to be satisfied with an expensive hospital, or perhaps the care and overall experience are better at more expensive hospitals and so they are more likely to get high ratings?

#add cost metrics
cost <- paymentandvalue %>%
  select(c(Provider.ID, Hospital.name, Payment.measure.ID, Denominator, Payment)) %>%
  right_join(HCAHPS, by = "Provider.ID")

#plot our results by case
cost %>% 
  filter(Payment.measure.ID == "PAYM_30_AMI") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for AMI vs. Star Rating") +
  xlab("Star Rating")

cost %>% 
  filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for Heart Failure vs Star Rating") +
  xlab("Star Rating")

cost %>% 
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  ggplot(aes(H_STAR_RATING, Payment)) +
  geom_boxplot() +
  ggtitle("Payment for Pneumonia vs Star Rating") +
  xlab("Star Rating")

Interestingly, cost does not appear to have a consistent relationship with star rating. Only cost for pneumonia treatment appears to be decreasing with increased star rating.

What about mortality rates?

#select the columns we want relating to mortality
mort <- complicationshospital %>%
  select(c(Provider.ID, Hospital.Name, Measure.ID, Denominator, Score)) %>%
  spread(Measure.ID, Score)
  
mort[mort == "Not Available"] = NA

#join with our HCAHPS dataset   
mort <- HCAHPS %>%
  right_join(mort, by = "Provider.ID")

#plot our results by case
mort %>% 
  select(H_STAR_RATING, MORT_30_AMI) %>%
  unique() %>%
  ggplot(aes(H_STAR_RATING, MORT_30_AMI)) +
  geom_boxplot() +
  ggtitle("AMI Mortality vs Star Rating") +
  xlab("Star Rating")

mort %>% 
  select(H_STAR_RATING, MORT_30_HF) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_HF)) +
  geom_boxplot() +
  ggtitle("Heart Failure Mortality vs Star Rating") +
  xlab("Star Rating")

mort %>% 
  select(H_STAR_RATING, MORT_30_STK) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_STK)) +
  geom_boxplot() +
  ggtitle("Stroke Mortality vs Star Rating") +
  xlab("Star Rating")

mort %>% 
  select(H_STAR_RATING, MORT_30_PN) %>%
  ggplot(aes(H_STAR_RATING, MORT_30_PN)) +
  geom_boxplot() +
  ggtitle("Pneumonia Mortatlity vs Star Rating") +
  xlab("Star Rating")

Much like cost, mortality does not appear to have a consistent relationship with star rating. AMI mortality rates only had a slight decline with increased star rating. Surprisingly, HF mortality rates appear to actually increase with higher star ratings.

How do mortality rates from PCI compare?

#merge mortality info from PCI
PCIbyhospital <- PCIbyhospital %>%
  left_join(ID_key, by = "Facility.Id")

combined_data <- PCIbyhospital %>% 
  select(Provider.ID, Number.of.Cases, Risk.Adjusted.Mortality.Rate) %>%
  na.omit()

combined_data <- mort %>%
  select(Provider.ID, H_STAR_RATING) %>%
  right_join(combined_data, by = "Provider.ID") 

combined_data %>%
  ggplot(aes(H_STAR_RATING, Risk.Adjusted.Mortality.Rate)) +
  geom_boxplot() +
  ggtitle("Risk Adjusted Mortality vs. Star Rating") +
  xlab("Star Rating")

Based on the above boxplots, it looks like there is no real difference in mortality rate between the star ratings, though there is actually a slight increase in the rate for 4 star hospitals. This is similar to our earlier finding on heart failures.

Finally, given that we saw some weak relationship between number of cases seen and mortality, let us take a look at its relationship to star rating.

#we will use number of cases seen from the complications table
cost %>%
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of Pneumonia vs Star Rating") +
  xlab("Star Rating")

cost %>%
  filter(Payment.measure.ID == "PAYM_30_AMI") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of AMI vs Star Rating") +
  xlab("Star Rating")

cost %>%
  filter(Payment.measure.ID == "PAYM_30_HF") %>%
  ggplot(aes(H_STAR_RATING, Denominator)) +
  geom_boxplot() +
  ggtitle("Number of Cases of HF vs Star Rating") +
  xlab("Star Rating")

There does not appear to be a significant relationship between cases seen and star rating. While it looks like one star hospitals tend to see less cases of pneumonia and AMI, the same cannot be visually seen for heart failure.

Other predictors

Readmission

We hypothesized that hospitals with higher HCAHPS scores would have lower 30 day readmission rates.

We analyzed the rate of patients readmitted for a complaint related to their original admission within 30 days of discharge for patients of all disease types:

readmissionInfection <- filter(readmission, Measure.Name=="READM-30-COPD-HRRP") %>% 
  select(Hospital.Name, Provider.Number, Number.of.Readmissions, Number.of.Discharges)

#convert to correct type
readmissionInfection[, 3] <- as.numeric(as.character( readmissionInfection[, 3] ))
readmissionInfection[, 4] <- as.numeric(as.character( readmissionInfection[, 4] ))
readmissionInfection <- na.omit(readmissionInfection)

readmissionInfection <- mutate(readmissionInfection, readmissionRate= Number.of.Readmissions/Number.of.Discharges)

readmitHCAHPS <- left_join(HCAHPS, readmissionInfection, by= c("Provider.ID" = "Provider.Number")) %>%
  select(-c(Hospital.Name))

#filter by overall score
readmitHCAHPS %>%
  ggplot(aes(H_STAR_RATING, readmissionRate)) + 
  geom_boxplot() + 
  xlab("Star Rating") + 
  ylab("Readmission Rate")

Indeed, there seems to be a trend towards lower 30 readmission rates for hospitals with higher HCAHPS scores.

If our theory is, indeed, true then we would expect that patients with lower readmission rates would rate their discharge instructions and communication about medicines highly:

#visualize the relationship
readmitHCAHPS %>%
  ggplot(aes(H_COMP_5_LINEAR_SCORE, readmissionRate)) + 
  geom_point() + 
  geom_smooth() +
  xlab("Discharge Instructions Score") + 
  ylab("Readmission Rate") + 
  ggtitle("Discharge Information")

readmitHCAHPS %>%
  ggplot(aes(H_COMP_6_LINEAR_SCORE, readmissionRate)) + 
  geom_point() + 
  geom_smooth(span = .5) + 
  xlab("Communication about Medicine Score") +
  ylab("Readmission Rate") + 
  ggtitle("Communication about Medicine")

There seems to be a downward trend in readmission rates as the rating of the discharge instructions and medication communication increase.

Time

Time spent in the emergency department is more commonly being looked at as an indicator of the overall patient experience. We looked at the time it took hospitals to move admitted patients out of the ED to their rooms and compared it with HCAHPS score.

#time in ED
temp <- timely %>%
  filter(Measure.ID=="ED_1b") #Median time from emergency department arrival to emergency department departure for admitted emergency department patients

#add to the readmitHCAHPS
temp <- left_join(temp, readmitHCAHPS, by= "Provider.ID") %>%
  select(Provider.ID, Measure.ID, Score, Sample, readmissionRate, H_STAR_RATING)

#plot the trend against readmission
ggplot(temp, aes(x=temp$readmissionRate, y= temp$Score)) + 
  geom_point() + 
  geom_smooth() +
  xlab("Readmission Rate") + 
  ylab("ED Wait Times (for admitted patients)")

#plot the trend against readmission
ggplot(temp, aes(H_STAR_RATING, temp$Score)) + 
  geom_boxplot() +
  xlab("Readmission Rate") + 
  ylab("ED Wait Times (for admitted patients)")

It appears that as the hospitals with lower HCAHPS star ratings tended to have higher ED wait times.

Complications

#just focus on serious blood clots
BloodClotScore <- complicationshospital %>% 
  filter(Measure.Name=="Serious blood clots after surgery", rm.na=TRUE)

#class type change
BloodClotScore$Score <- as.numeric(BloodClotScore$Score)
BloodClotScore <- BloodClotScore[order(BloodClotScore$Score),]

names(BloodClotScore)[13] <- "Clot.Score"

#add the star rating
CompsHCAHPS <- left_join(BloodClotScore, HCAHPS, by= "Provider.ID") %>%
  na.omit()

#plot the relationship
CompsHCAHPS %>%
  ggplot(aes(H_STAR_RATING, Clot.Score)) + 
  geom_boxplot() + 
  xlab("Star Rating") + 
  ylab("Clot Rate") + 
  ggtitle("Clots vs. Star Rating")

HCAHPS star ratings do not appear to have dramatically different rates of serious complications like blood clot. It doesn’t seem that people are likely to change their evaluation of the care based on complications.

Finally, to round out our examination of various survival metrics, let us also examine infections.

#merge infection info, convert to provider ID using the key
tmp <- hospitalinfection %>%
  filter(Year == 2014) %>%
  select(Facility.Id, Infections.observed) %>%
  group_by(Facility.Id) %>%
  summarise(Infections = sum(Infections.observed)) %>%
  left_join(ID_key, by = "Facility.Id") %>%
  na.omit()

#add star rating
infect_star <- tmp %>%
  select(Provider.ID, Infections) %>%
  right_join(HCAHPS, by = "Provider.ID")

#plot the relationship
infect_star %>%
  ggplot(aes(H_STAR_RATING, Infections)) +
  geom_boxplot() +
  ggtitle("Total Hospital Infections vs. Star Rating") +
  xlab("Star Rating")

It looks like infections from hospitals is slightly lower for the higher 4 star hospitals, but not significantly different for the other hospitals.

Aggregating the findings

From all of the analysis above, it looks like the best candidates for additional predictors of HCAHPS star rating are readmission, time spent in the emergency room, heart attack mortality, and payments for pneumonia.

#create the final dataset
final_dat <- HCAHPS

#add pneumonia payment
final_dat <- paymentandvalue %>%
  filter(Payment.measure.ID == "PAYM_30_PN") %>%
  select(Provider.ID, Payment) %>%
  right_join(HCAHPS, by = "Provider.ID")

#add readmission
final_dat <- readmitHCAHPS %>%
  select(c(Provider.ID, readmissionRate)) %>%
  right_join(final_dat, by = "Provider.ID")

#add heart failure rates
final_dat <- mort %>%
  select(c(Provider.ID, MORT_30_HF)) %>%
  right_join(final_dat, by = "Provider.ID")

#add ED times
final_dat <- temp %>%
  select(Provider.ID, Score) %>%
  right_join(final_dat, by = "Provider.ID") %>%
  select(-c(Provider.ID)) %>%
  na.omit()

#let's make the column names friendly
colnames(final_dat) <- c("ER_Wait", "HF_Mort", "Readmission", "Payment", "Star_Rating", "Cleanliness", "NurseComm", "DoctorComm", "StaffResponsiveness", "PainMmgt.", "MedicationComm.", "Discharge_Info", "Care_Transition", "Overall_Rating", "Quietness")

#turn the star rating into a number
final_dat$Star_Rating <- as.numeric(as.character(final_dat$Star_Rating))

#we have to standardize bc star rating is in a different unit
final_dat_dist <- as.matrix(final_dat) %>%
  scale() %>%
  t() %>%
  dist()

final_dat <- subset(final_dat, select=c(Star_Rating,ER_Wait:Quietness))

#visualize the relationship
heatmap.2(as.matrix(final_dat_dist), 
          margins = c(10,10),
          cellnote= round(as.matrix(final_dat_dist),3), 
          notecex=1.0,
          notecol="cyan",
          dendrogram = 'none',
          trace = 'none',
          key = FALSE)

From the heatmap above, we see that nurse communication and medication communication are still the closet to the star rating. Out of the new predictors we added, mortality rate for heart failures is the closes to star rating.

Now, let’s revisit our knn model

set.seed(1)

#model RMSE over k
control <- trainControl(method='cv', number=10, p=.8)
res_knn <- train(Star_Rating ~ .,
             data = final_dat,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,15,1)))

#graph the results
res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over k")

res_knn
## k-Nearest Neighbors 
## 
## 140 samples
##  15 predictor
## 
## Pre-processing: centered (15), scaled (15) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 126, 125, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   RMSE SD     Rsquared SD
##    1  0.2390977  0.8875989  0.17706536  0.10081678 
##    2  0.2566128  0.9014948  0.08829337  0.05838738 
##    3  0.2376135  0.9180396  0.08102504  0.05021170 
##    4  0.2434689  0.9233081  0.05831299  0.03096274 
##    5  0.2509484  0.9211828  0.04267224  0.02308849 
##    6  0.2660685  0.9101836  0.04012624  0.03211314 
##    7  0.2724755  0.9082243  0.04299217  0.03096219 
##    8  0.2822545  0.9003718  0.05013693  0.03309965 
##    9  0.2842629  0.9026982  0.04819130  0.03070106 
##   10  0.2916938  0.8965214  0.04790772  0.03904921 
##   11  0.2939657  0.8993798  0.04257018  0.03777148 
##   12  0.3001591  0.8966483  0.03976504  0.03380846 
##   13  0.3049867  0.8911890  0.04359609  0.03863477 
##   14  0.3104284  0.8903422  0.04249140  0.03634086 
##   15  0.3114241  0.8893125  0.04138787  0.03372770 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 3.

It looks like the best k value is 3. Let’s run our prediction model 10 times and see if our average results have improved.

#let's create a model with k = 3

knnNew_10 <- sapply(1:10, function(x) {
  inTrain <- createDataPartition(y = final_dat$Star_Rating , p = 0.8)
  train_set <- slice(final_dat, inTrain$Resample1)
  test_set <- slice(final_dat, -inTrain$Resample1)
  
  #reset factors for test set
  test_set$Star_Rating  <- factor(test_set$Star_Rating )
  
  #fit the data
  
  pred <- predict(res_knn, newdata = test_set)
  
  pred = round(pred)
  truth = test_set$Star_Rating
  u = union(pred, truth)
  tab <- table(factor(pred, u), factor(truth, u))
  
  confusionMatrix(tab)$overall["Accuracy"]
})

mean(knnNew_10)
## [1] 0.9740741
sd(knnNew_10)
## [1] 0.03049158

Our average accuracy is now around 97%.

Finally, out of curiosity, let’s see how well our additional predictors do without the other HCAHPS metrics.

set.seed(1)

final_dat2 <- final_dat %>%
  select(c(Star_Rating:Payment))

control <- trainControl(method='cv', number=10, p=.8)
res_knn <- train(Star_Rating ~ .,
             data = final_dat2,
             method = "knn",
             trControl = control,
             preProcess = c("center","scale"),
             tuneLength = 1,
             tuneGrid=data.frame(k=seq(1,5,1)))

res_knn$results %>% 
  ggplot(aes(k, RMSE, ymin= RMSE - RMSESD, 
             ymax = RMSE + RMSESD)) +
  geom_point() + geom_errorbar() +
  ggtitle("RMSE over k")

res_knn
## k-Nearest Neighbors 
## 
## 140 samples
##   4 predictor
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 127, 126, 126, 125, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE       Rsquared    RMSE SD     Rsquared SD
##   1  0.9649469  0.10071298  0.20408459  0.11372680 
##   2  0.8830800  0.12480524  0.15366005  0.13884282 
##   3  0.8706896  0.08112718  0.10248839  0.12793359 
##   4  0.8354573  0.08279195  0.09522166  0.09401944 
##   5  0.8044505  0.12537656  0.09782312  0.12198666 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 5.

From the RMSE over K results above, we can intuit without further modeling that by themselves the additional non-HCAHPS predictors are not particularly powerful factors in star ratings.

Conclusion/Final Analysis

In the wake of the Affordable Care Act the American people expect better quality healthcare at a lower cost. The government encourages hospitals to focus on the patient experience by publishing HCAHPS scores and tying financial incentives to HCAHPS scores. We thought larger centers would be more cost effective but we found that this is not necessarily the case. We also expected hospitals with better survival rates to have better HCAPHS scores, but again this was not true.

Readmission rates were useful in predicting the HCAHPS score. Similarly, according to the heatmap, the score on communication related questions were close to the HCAHPS overall star rating. These two facts taken together suggest that communication may be important in reducing readmission and improving the patient experience. Patients who understand how and when to take their medication, for example, are more likely to take it as prescribed and are, therefore, less likely to come back into the hospital.

HCAHPS metrics surrounding communication were the best predictors of overall HCAHPS score. Hospitals looking to improve their HCAHPS scores should start by improving communication between providers and patients.

Time spent in the emergency room is also an important predictor of star rating. Perhaps a long stay in the emergency department is a symptom of other inefficiencies elsewhere in the hospital. Patients with long stays in the emergency department are likely to be upset and these negative feelings may cast aspersions on the rest of their hospital stay.

Management of heart attacks is very specialized and benefits from a structured approach and from advanced diagnostic and treatment modalities, more so than other areas of medicine (such as pneumonia). It is difficult to treat in the hospital and difficult to manage after discharge, often requiring a large interdisciplinary team. This may be one explanation for why we saw an increase in mortality rate with higher star ratings. Hospitals that are well equipped to treat heart failure may attract more cases and more severe/complex cases because of that very competency.

Using these measures we were able to achieve over 95% accuracy in predicting HCAHPS scores. Interestingly, the non-HCAHPS predictors also had some predictive power, but not to the extent one would expect. It could be that our dataset was too limited and didn’t have the right metrics. Perhaps a focus on non-emergency/severe treatments might have produced different results. We also can’t be sure if the random population surveyed is representative of the population from which we sourced our other quantitative metrics. It is also possible that, like the article referenced earlier, the relationship between patient satisfaction and actual care received is weak. Our result on increased heart failures at higher rated hospitals may be evidence toward that claim.